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Abstract 


We  describe  an  Instrumentation  of  TCP/IP  that  monitors  TCP  connections  and  pro* 
vldes  values  of  Internal  variables  of  the  Implementation.  We  define  Interface  events  for  a 
TCP/IP  connection,  describe  how  traces  are  obtained,  and  how  application  processes  ini* 
tiate  trace  collection.  The  instrumentation  has  been  Implemented  In  4.3DSD  UNIX.*  The 
instrumented  TCP/IP  provides  a  flexible  environment  for  experimental  studies.  Using  the 
instrumentation,  we  have  studied  the  performance  of  different  roundtrip-time  estimators 
In  the  Internet  environment.  One  conclusion  of  our  study  Is  that  clock  resolution  is  an 
Important  parameter,  and  the  resolution  currently  used  In  UNIX  Implementations  of  TCP 
is  woefully  inadequate.  Another  conclusion  Is  that,  with  an  adequate  clock  resolution,  a 
recently  proposed  estimator  performs  substantially  better  than,the  estimator  suggested  in 
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1  Introduction 


The  Transmission  Control  Protocol  (TCP)  [14]  is  a  connection-oriented, 
transport  layer  protocol  that  is  used  extensively  in  computer  networks,  both 
local-area  and  wide-area.  TCP  operates  above  the  Internet  Protocol  (IP) 
[13],  and  provides  reliable  data  transfer  service  to  application  protocols,  such 
as  file  transfer  and  remote  login. 

IP  provides  TCP  with  virtual  communication  channels  between  every 
two  host  computers  of  the  network.  However,  the  virtual  channels  are  un¬ 
reliable,  especially  in  a  wide-area  network,  such  as  the  Internet  [5],  where 
the  channels  are  implemented  by  store-and-forward  routing.  They  can  lose, 
reorder  and  duplicate  messages  in  transit.  Furthermore,  they  display  con¬ 
gestive  behavior,  by  which  we  mean  that  their  delay  and  loss  characteristics 
depend  significantly  on  the  number  of  messages  in  transit  in  the  channel. 
Typically,  once  this  number  exceeds  a  certain  threshold,  congestion  sets  in; 
message  delays  increase  drastically  and  throughput  levels  off  or  decreases. 

To  achieve  reliable  data  transfer  over  such  virtual  channels,  TCP  uses  a 
sliding  window  mechanism,  involving  data  sequence  numbers,  acknowledge¬ 
ment  messages,  send  and  receive  windows,  and  retransmissions.  Consider 
data  transfer  from  a  source  application  entity  to  a  destination  application 
entity.  Let  us  refer  to  the  TCP  entity  at  the  source  (destination)  as  the 
source  (destination)  TCP  entity. 

The  source  application  entity  periodically  produces  data  and  passes  it  to 
the  source  TCP  entity,  which  assigns  increasing  sequence  numbers  to  succes¬ 
sive  data  octets.  The  source  TCP  entity  buffers  the  data  octets  until  they 
are  acknowledged  by  the  destination  TCP  entity.  The  send  window  refers 
to  the  set  of  (contiguous)  sequence  numbers  corresponding  to  the  buffered 
data.  Periodically,  the  source  TCP  entity  sends  packets,  each  containing  one 
or  more  contiguous  data  octets  accompanied  by  the  sequence  number  of  the 
first  octet  and  the  number  of  octets. 

The  destination  TCP  entity  maintains  a  set  of  (contiguous)  sequence 
numbers,  referred  to  as  the  receive  window.  Data  octets  below  the  receive 
window  have  been  passed  to  the  destination  app  ’  ^.on  entity.  Data  octets 
received  out  of  sequence  but  within  the  receive  wi  . 1  v  are  buffered.  Peri¬ 
odically,  the  destination  TCP  entity  sends  an  acknowledgement  indicating 
the  current  receive  window. 

The  source  TCP  entity  maintains  an  estimator  for  the  roundtrip  time. 
When  the  source  TCP  entity  sends  a  packet,  it  starts  a  retransmission  timer 
with  a  timeout  equal  to  the  current  value  of  the  estimator.  If  the  timer 
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expires  and  the  packet  is  not  yet  acknowledged,  the  packet  is  retransmitted. 

While  the  sliding  window  mechanism  effectively  ensures  that  data  is  not 
delivered  out  of  sequence  [18],  obtaining  good  performance  over  congestive 
channels  is  an  open  research  area  that  is  becoming  increasingly  important 
as  networks  become  larger  and  more  heterogeneous  [4,  7,  8,  11,  12,  20]. 

The  performance  of  a  TCP  connection  depends  on  various  policies  em¬ 
ployed  by  the  TCP  entities  regarding  transmission,  retransmission,  round- 
trip  time  estimation,  window  sizes,  etc.  Due  to  the  congestive  nature  of 
the  channels,  there  is  considerable  interaction  between  the  policies  and  the 
amount  of  congestion  in  the  network.  To  put  it  another  way,  a  TCP  imple¬ 
mentation  with  bad  policies,  not  only  offers  low  performance  to  its  applica¬ 
tion  entities,  but  can  also  severely  degrade  the  overall  performance  of  the 
network  by  in  roducing  congestion. 

To  understand  the  behavior  of  such  a  complex  system,  it  is  essential  to  do 
experimental  work  with  instrumented  TCP/IP  implementation.  Recently, 
there  has  been  much  effort  in  this  direction  [3,  4,  6,  7,  12,  17].  Cabrera  et  al 
[3]  have  studied  TCP  connections  across  two  Ethernets  connected  by  a  VAX 
gateway.  They  examine  throughput  versus  TCP  packet  size.  Van  Jacobson 
[6,  7]  has  studied  TCP  connections  across  two  lOMbs  Ethernets  connected  by 
a  succession  of  IP  level  links,  including  a  bottleneck  link  of  230  Kbs.  He  has 
implemented  algorithms  for  roundtrip-time  variance  estimation,  exponential 
retransmission  backoff,  and  slow  start.  Clark  [4]  has  studied  connections 
across  Ethernets  connected  by  gateways,  and  has  implemented  policies  that 
reduce  congestion.  Nagle  [12]  has  done  similar  work  over  local  and  wide- 
area  connections.  Seo  et  al  [17]  have  studied  the  performance  of  SATNET, 
which  links  the  Internet  in  North  America  to  European  networks.  SATNET 
itself  consists  of  four  nodes  fully  interconnected  by  two  multi-access  64  Kbs 
satellite  channels  with  a  propagation  delay  of  0.8  seconds. 

There  are  two  facilities  available  in  UNIX  for  studying  network  behav¬ 
ior.  One  is  the  TCP  trace  facility,  which  works  by  setting  the  SO-DEBUG 
option  on  BSD  sockets.  This  is  useful  for  debugging  connections,  but  not  for 
gathering  performance  data,  because  it  uses  the  kernel  printf  routine  to  print 
state  and  packet  information  while  processing  a  packet.  The  kernel  printf 
routine  is  not  interrupt  driven,  and  all  system  activities  are  suspended,  while 
it  is  executing.  This  can  skew  the  observations  considerably.  The  other  fa¬ 
cility  is  the  tcpdump  program,  which  is  used  for  passive  monitoring  from 
a  host  on  the  same  local  network  as  the  test  host.  This  facility  does  not 
affect  the  test  host,  but  cannot  access  internal  parameters  of  TCP,  such  as 
the  send  window. 
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Our  goal  was  to  obtain  a  general  instrumentation  of  TCP/IP  that  would 
allow  us  to  study  transient  and  steady-state  correlations  between  different 
parameters  of  interest,  in  both  local-area  and  Internet  environments. 

Our  Instrumentation 

In  this  paper,  we  discuss  an  instrumentation  of  TCP/IP,  that  has  been 
implemented  in  4.3BSD  UNIX2  and  is  currently  running  on  a  VAXstation 
32003.  Given  a  TCP  connection  between  two  applications,  say  a  client  and  a 
server,  our  instrumented  TCP/IP  logs  an  entry  for  every  packet  that  crosses 
an  interface.  Each  log  entry  contains  the  following  information:  the  time 
of  occurrence  as  indicated  by  a  local  clock,  values  of  different  fields  on  the 
packet,  and  current  values  of  identified  state  variables  of  the  connection.  Log 
entries  can  be  recorded  either  at  the  client  host,  or  at  the  server  host,  or  at 
both  hosts.  In  each  host,  the  log  entries  are  collected  in  a  trace.  Logging 
can  be  initiated  by  either  the  client  or  the  server,  or  by  both.  In  the  case 
of  logging  at  both  client  and  server  hosts,  one  option  is  to  include  a  unique 
transmission  number  in  each  TCP  packet  sent.  This  allows  identification  of 
lost  and  duplicate  packets. 

An  extremely  powerful  use  of  this  instrumentation  is  to  have  both  the 
client  and  the  server  on  the  same  host,  with  the  packets  being  routed  via 
one  or  more  specified  gateways.  In  this  case,  there  is  a  single  trace  for  both 
ends  of  the  connection.  From  this  trace,  we  can  obtain  parameters  such 
as  one-way  delay  of  each  packet,  number  of  packets  in  transit,  number  of 
packets  lost,  etc.,  and  study  the  evolution  of  these  parameters  with  time 
and  their  cross-correlations.  This  capability  of  the  instrumentation  appears 
to  be  unique. 

Having  both  client  and  server  on  the  same  host  has  other  advantages. 
It  avoids  the  need  for  synchronizing  the  clocks  in  two  hosts.  It  allows  us  to 
experiment  with  multi-gateway  channels  in  the  Internet  with  only  a  single 
host  running  the  instrumented  kernel. 

We  have  also  developed  a  set  of  post  processing  tools  to  analyze  the 
trace  and  present  results  in  statistical  and  graphical  forms.  With  these  tools 
our  system  provides  an  excellent  environment  for  performing  experimental 
studies.  Due  to  the  detailed  information  available  about  the  behavior,  this 
instrumentation  can  be  used  to  validate  analytic  models  of  protocol  behavior 
(such  as  in  [1,  9]),  which  often  state  the  dynamic  properties  of  different 

3  A  preliminary  version  was  also  implemented  in  SUN  OS  3.2. 

3  VAXstation  is  a  trademark  of  Digital  Equipment  Corporation. 
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variables. 


Evaluation  of  roundtrip-time  estimators 

The  roundtrip  time  of  a  packet  is  the  time  interval  between  sending  the 
packet  and  receiving  its  acknowledgement.  In  TCP,  the  roundtrip  times 
observed  by  a  TCP  entity  are  the  only  information  that  it  has  concerning 
the  amount  of  congestion  currently  in  the  network.  It  uses  these  roundtrip 
times  to  maintain  an  estimate  of  the  current  roundtrip  time.  When  a  packet 
is  sent,  this  estimate  is  used  to  set  the  retransmission  timeout  of  the  packet. 

Clearly,  a  good  roundtrip-time  estimator  is  essential  for  good  TCP  per¬ 
formance.  If  the  estimate  is  too  high,  packet  losses  will  be  detected  late.  As 
a  result,  retransmissions  will  be  delayed  and  throughput  will  decrease.  If 
the  estimate  is  too  low,  the  TCP  entity  will  retransmit  packets  that  are  still 
in  transit.  This  may  lead  to  congestion  [12]. 

We  have  used  our  instrumented  TCP/IP  to  evaluate  the  performance  of 
different  estimators.  In  this  report,  we  investigate  the  effect  of  the  clock  reso¬ 
lution  used  to  measure  the  roundtrip  times.  We  also  compare  the  roundtrip- 
time  estimator  suggested  by  Van  Jacobson  [7]  against  the  one  suggested  in 
the  TCP  specification  [14].  The  error  for  a  packet  is  defined  as  the  difference 
between  the  value  of  the  estimator  at  the  time  of  sending  the  packet  and  the 
roundtrip  time  experienced  by  the  packet.  The  sample  standard  deviation 
of  these  errors  is  the  metric  we  use  to  evaluate  the  estimator. 

Organization  of  the  rest  of  the  paper 

In  Section  2,  we  discuss  the  design  issues  involved  in  instrumenting  a 
TCP/IP  implementation.  In  Section  3,  we  discuss  the  UNIX  implementation 
of  the  instrumentation.  In  Section  4,  we  discuss  some  experiments.  In 
Section  5,  we  conclude  and  suggest  future  extensions  of  this  work. 
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2  Instrumentation  of  TCP/IP 

Figure  1  illustrates  the  protocol  organization  between  two  hosts  A  and  B 
connected  via  the  TCP/IP  protocol.  APP,*,  TCP,*  and  IP,*  are  the  Appli¬ 
cation,  TCP,  and  IP  entities  in  host  A,  respectively.  The  entities  in  host 
B  are  organized  similarly.  These  entities  define  three  interfaces ,  namely, 
APP/TCP,  TCP/IP,  and  IP/Network.  Packets  can  cross  an  interface  in 
either  direction.  The  natural  time  to  collect  information  is  when  a  packet 
crosses  an  interface. 


HOST,*  HOSTb 


NETWORK 


Figure  1:  Organization  of  a  TCP  connection 


2.1  Data  Logging 

Most  application  entities  communicate  according  to  the  client-server  model. 
In  this  model,  an  application  entity  is  either  a  server  or  a  client.  Servers 
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provide  a  service  (e.g.  file  transfer)  to  clients.  Only  clients  can  initiate 
requests  for  service. 

An  application  entity  can  request  either  local  logging  or  two-host  logging. 
In  local  logging ,  only  packet  crossings  on  local  host  interfaces  are  logged.  In 
two-host  logging ,  in  addition  to  logging  at  the  local  host,  the  remote  host  is 
requested  to  start  logging  at  its  interfaces.  This  request  can  be  conveyed  by 
sending  a  transmission  number  in  the  TCP  option  list. 

Successive  TCP  packets  (including  retransmissions)  have  consecutively 
increasing  transmission  numbers,  starting  with  1.  The  transmission  num¬ 
ber  is  sent  only  if  at  least  one  of  the  applications  has  requested  two-host 
logging.  On  receiving  a  packet  with  a  transmission  number  in  it,  the  TCP 
entity  starts  logging  for  that  connection  and  begins  to  include  transmission 
numbers  in  outgoing  packets. 

A  special  case  of  two-host  logging  is  to  have  both  the  client  and  the 
server  on  the  same  host,  with  the  packets  of  the  connection  being  routed  via 
one  or  more  specified  gateways  (using  the  IP  LSRR  option  [10]). 

2.2  Format  of  a  Log  Entry 

A  log  entry  is  made  when  a  packet  crosses  an  interface.  Every  log  entry  con¬ 
tains  a  timestamp  obtained  from  a  clock  in  the  host,  source  and  destination 
port  numbers,  and  the  transmission  number.  Additional  fields  in  the  log 
entry  depend  on  the  interface  at  which  it  is  logged  and  are  described  below: 

Application/TCP  interface: 

•  Number  of  outstanding  octets  (i.e.  number  of  octets  given  by  the 
application  that  have  not  been  acknowledged) 

TCP/IP  interface: 

•  Fields  from  the  packet: 

send  sequence  number 

acknowledgement  sequence  number 

receive  window  size 

packet  size 

packet  header  size 

TCP  header  flags 
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send  window  size 

•  Outstanding  data  in  the  connection  at  that  host. 
IP/Network  interface: 

•  Fields  from  the  packet 

IP  time  to  live 
IP  header  length 
IP  packet  length 


The  trace  of  a  connection  contains  (arguably)  all  the  information  needed 
for  analysis.  From  it,  we  can  extract  the  values  of  state  variables  at  different 
instants,  study  relationships  between  them,  and  obtain  performance  mea¬ 
sures. 

For  example,  a  packet  is  considered  lost,  if  there  is  a  log  record  indicating 
it  was  sent  but  none  indicating  that  it  has  been  received.  The  number  of 
times  an  octet  has  been  retransmitted  can  be  obtained  by  scanning  the  log 
records  of  send  events.  The  throughput  of  a  connection  is  the  number  of 
octets  sent,  divided  by  the  total  time  of  the  connection. 

2.3  Implementation  Issues 

A  major  requirement  of  the  instrumentation  is  that  it  should  have  minimal 
effect  on  the  results. 

A  log  entry  is  appended  to  the  trace  every  time  a  packet  crosses  an 
interface  between  the  two  entities.  To  minimize  the  effect  of  logging,  the  log 
entry  for  a  packet  is  made  after  the  packet  has  been  sent. 

Because  the  number  of  packets  sent  in  a  connection  can  be  large,  the  size 
of  the  trace  cam  exceed  the  size  of  physical  memory.  However,  we  cannot 
allow  the  TCP  or  IP  entities  to  append  log  entries  to  a  disk  file,  because 
that  would  be  very  slow,  thereby  affecting  the  experiment.  Our  choice  was 
to  append  the  log  entries  to  a  buffer  in  physical  memory.  A  reader  process 
periodically  transfers  these  entries  to  a  disk  file. 

Access  to  the  shared  memory  by  the  TCP  entity  and  the  reader  process 
has  to  be  mutually  exclusive.  We  try  to  keep  the  critical  section  access  to 
minimum.  Our  method  is  to  have  a  linked  list  of  buffers,  with  the  critical 
section  involving  only  the  modification  of  pointers.  The  reading  and  writing 
of  the  buffers  is  done  outside  the  critical  section.  If  there  is  no  empty  buffer 
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available,  TCP  and  IP  do  not  make  a  log  entry.  This  avoids  blocking  when 
the  reader  process  is  slow.4 

The  logging  of  a  connection  should  not  affect  other  connections  that 
have  not  opted  for  logging.  In  our  implementation,  we  set  a  flag  for  each 
connection  for  which  logging  is  desired.  No  logging  is  done  if  this  flag  is  not 
set. 


'Also,  the  user  may  not  have  started  the  reader  process. 
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3  Implementation  under  UNIX 


In  4.3BSD  UNIX,  the  TCP/IP  routines  are  part  of  the  kernel.  Here  we  de¬ 
scribe  briefly  the  modifications  that  we  have  made  to  the  kernel.  A  detailed 
description  can  be  found  in  [15]. 

The  TCP  and  IP  entities  write  their  log  entries  in  main  memory.  For 
this  purpose,  a  kernel  memory  area  that  is  accessible  to  the  TCP  and  IP 
routines  is  required.  The  tcpinit()  routine,  which  is  executed  as  a  part 
of  kernel  initialization  procedure,  has  been  modified  to  allocate  a  block  of 
memory.  This  block  is  organized  into  two  linked  lists  of  records  -  the  empty 
list  and  thu  full  list.  Each  record  can  hold  one  log  entry.  Initially,  ail  the 
records  are  in  the  empty  list. 

When  a  packet  crosses  an  interface,  the  modified  TCP  and  IP  routines 
write  a  log  entry  in  an  empty  record,  and  append  it  to  the  full  list.  If  there 
is  no  record  in  the  empty  list,  no  log  entry  is  appended. 

There  is  a  reader  process  that  reads  log  entries  from  the  memory  and 
writes  them  to  a  disk  file.  The  reader  process  views  the  memory  as  a  read¬ 
only  device  called  netlog.  A  device  driver  has  been  written  for  this  pseudo¬ 
device. 

The  reader  process  is  started  at  the  beginning  of  the  experiment  and  runs 
throughout  the  experiment.  It  employs  blocking  I/O  so  that  it  is  suspended 
when  there  are  no  records  in  the  full  list.  It  is  woken  up  by  the  TCP  and  IP 
entities  when  they  append  a  log  entry  to  the  full  list.  The  TCP/IP  entities 
and  the  netlog  device  driver  ensure  that  accesses  to  the  free  and  empty  lists 
are  mutually  exclusive  by  raising  the  priority  of  the  cpu. 

The  traces  of  all  the  connections  are  written  in  the  same  file.  The  trace 
for  a  particular  connection  can  be  extracted  during  post  processing. 

3.1  Application  Interface 

An  application  entity  performs  different  activities  to  establish  a  connection, 
depending  on  whether  it  is  a  client  or  a  server  [10].  A  server  entity  executes 
the  following  steps: 

Si:  Inform  the  local  TCP  entity  of  its  willingness  to  provide  service  by 
creating  a  socket. 

S2:  Inform  the  TCP  entity  that  it  is  ready  to  receive  service  requests. 

S3:  Wait  for  an  incoming  connection  request  from  a  client  entity. 

S4:  Service  the  connection  until  termination. 
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A  client  entity  executes  the  following  steps: 

Ci:  Inform  the  local  TCP  entity  of  its  need  to  get  service.  A  socket  is 
created  for  the  client. 

C2:  Request  connection  to  the  server. 

C3:  Once  the  connection  is  established  the  client  may  begin  requesting 
service. 

UNIX  provides  the  setsockoptQ  call  for  applications  to  set  different 
socket  options.  We  have  modified  the  setsockoptQ  call  such  that  the  logging 
option  can  also  be  set  by  an  application  entity. 

For  each  connection,  UNIX  maintains  a  number  of  data  structures  to 
support  inter-process  communication.  Here,  we  mention  the  ones  relevant 
to  our  discussion.  For  each  connection  in  the  system,  three  structures,  called 
tcpcb,  inpcb,  and  socket  are  maintained.  Tcpcb  contains  the  values  of  TCP 
state  variables.  Inpcb  contains  the  protocol  independent  information  like 
routing  entry  and  the  IP  options.  Socket  has  pointers  to  send  and  receive 
buffer  queues.  These  structures  have  pointers  to  each  other.  The  inpcbs 
of  all  the  TCP  connections  in  the  system  are  linked  in  a  list.  We  keep 
the  transmission  number  for  a  connection  in  a  separate  mbuf  (the  unit  of 
memory  buffer  in  the  UNIX  kernel),  which  is  accessed  through  a  pointer 
in  tcpcb.  Recall  that  the  transmission  number  is  used  to  uniquely  identify 
packets  (see  ‘Our  Instrumentation’  subsection  in  Section  1). 

3.2  Modifications  to  TCP/IP  Routines 

The  TCP/IP  routines  have  been  modified  to  append  log  entries  to  the  ker¬ 
nel  memory  area.  In  our  current  implementation,  we  have  instrumented 
the  TCP/IP  and  the  IP/Network  interfaces.  Here  we  briefly  describe  the 
modifications  that  have  been  made  to  the  TCP/IP  routines. 

Packet  from  TCP  to  IP:  The  tcpjoutputQ  routine  takes  the  data  to 
be  sent  from  the  socket  queues.  It  appends  the  TCP  header  to  the  data  and 
passes  the  packet  to  IP  through  a  call  to  ip_output().  The  tcpjoutputQ  rou¬ 
tine  has  been  modified  to  append  a  log  entry  at  this  stage.  The  timestamp 
is  obtained  just  before  the  call  is  made.  The  log  entry  is  appended  after  the 
call  returns,  thereby  avoiding  a  delay  (due  to  logging)  in  sending  the  packet. 

Packet  from  IP  to  Network:  IP  receives  a  packet  from  TCP  through 
the  ip-outputQ  routine.  This  routine  has  been  modified  to  append  a  log 
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entry  just  before  it  makes  a  call  to  the  network  interface  driver.  It  decides 
whether  or  not  to  log  by  scanning  the  flags  passed  to  it  by  tcpx>utput(). 

Packet  from  Network  to  IP:  The  routine  that  handles  incoming  pack¬ 
ets  for  IP  is  ipJntr().  It  removes  the  packet  from  the  queue  and  determines 
whether  the  packet  is  destined  for  the  local  host  or  is  to  be  forwarded  to 
another  host.  In  the  former  case,  it  passes  the  packet  to  the  upper  layer 
protocol.  The  ip _intr()  routine  has  been  modified  to  append  a  log  entry  just 
before  calling  the  upper  layer  protocol.  The  time  stamp  for  this  entry  is 
taken  in  the  beginning  of  the  routine.  To  decide  whether  the  connection 
to  which  the  packet  belongs  has  the  logging  option  set,  the  tcpcb  of  this 
connection  is  examined. 

Packet  from  IP  to  TCP:  The  tcp_input()  routine  processes  an  incom¬ 
ing  packet  for  TCP.  It  calls  in_pcblookup()  to  determine  which  connection 
the  packet  should  go  to.  The  tcpJnput()  routine  has  been  modified  to  ap¬ 
pend  a  log  entry  if  that  connection  has  the  logging  option  set. 

Two  special  cases  arise  at  this  stage. 

(a)  When  a  SYN  packet  is  received  for  a  socket,  TCP  creates  new  instances 
of  the  socket,  the  inpcb,  and  the  tcpcb  data  structures  for  the  new 
connection.  This  portion  of  tcp_input()  has  been  modified  to  determine 
whether  the  parent  socket  had  the  logging  option  set.  If  it  did,  then 
tcp_input()  sets  the  option  for  the  new  socket  as  well. 

(b)  If  a  packet  is  received  with  the  transmission  number  in  the  TCP  op¬ 
tions,  the  modified  tcpJnput()  routine  sets  the  two- host  logging  option 
for  the  connection. 

The  Transmission  Number:  Conventional  TCP  uses  only  one  option, 
TCP-MAXSEG,  indicating  the  maximum  segment  size.  This  is  sent  along 
with  the  SYN  packets  that  the  two  hosts  exchange  while  establishing  a 
connection.  We  have  introduced  another  option  called  TCP.TRNUM  for  the 
transmission  number.  This  option  is  sent  on  every  packet  of  a  connection 
that  has  two-host  logging  option  set.  The  tcpjoutput()  routine  has  been 
modified  to  send  the  TCP.TRNUM  option. 

The  tcp_dooptions()  routine  processes  the  options  in  an  incoming  TCP 
packet.  This  routine  has  been  modified  to  recognize  the  TCP.TRNUM 
option.  If  a  transmission  number  is  present  and  the  two-host  logging  option 
has  not  already  been  set  for  the  connection,  then  this  routine  sets  the  option. 
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4  Evaluating  Roundtrip  time  Estimators 

We  have  performed  a  number  of  experiments  using  our  instrumented  TCP/¬ 
IP.  In  this  Section,  we  present  some  results  to  demonstrate  the  capabilities 
of  our  instrumentation  and  to  compare  different  roundtrip-time  estimators. 

The  TCP  implementation  in  4.3  BSD  UNIX  maintains  several  vari¬ 
ables  for  setting  the  retransmission  timeout  of  a  packet,  namely:  SRTT, 
RTTVAR,  RXT,  Roundtrip.Timer,  and  Retransmission-Timer.  SRTT  is 
the  “smoothed”  average  of  measured  roundtrip  times.  RTTVAR  is  the 
“smoothed”  variance  of  measured  roundtrip  times.  RXT  is  the  current 
retransmission-timeout  estimate.  Roundtrip.Timer  is  used  to  measure  round¬ 
trip  time  of  one  packet.  Retransmission-Timer  is  used  to  indicate  when  to 
retransmit. 

When  a  packet  is  transmitted  for  the  first  time  (i.e.,  contains  no  octet 
that  has  been  transmitted  already)  and  Roundtrip.Timer  is  not  active,  TCP 
records  the  sequence  number  of  the  first  byte  of  the  packet  and  starts  the 
timer.  Every  500  ms,  a  software  clock  interrupt  increments  Roundtrip.Timer 
by  l.5  When  an  acknowledgement  is  received  for  that  packet,  The  roundtrip 
time,  denoted  RTT,  for  that  packet  equals  the  value  of  Roundtrip.Timer 
multiplied  by  500  ms.  If  the  packet  is  retransmitted  before  its  acknowledge¬ 
ment  is  received,  the  roundtrip-time  measurement  is  aborted. 

Each  time  an  RTT  is  obtained,  three  of  the  above  variables  are  updated 
as  follows  (this  update  scheme  was  introduced  by  Van  Jacobson  [7]  and 
differs  from  the  suggested  in  the  TCP  specification  [14]): 

SRTT„ev,  =  a  SRTT  +(1  -  a)  RTT 

RTTVAR,,^  =  a'  RTTVAR  +(1  -  a')(|  RTT  -  SRTT  |  -  RTTVAR) 

RXTnew  =  SRTTneu,  +  2  RTTVAR***, 

TCP  uses  the  values  a  —  7/8  and  a'  =  6/8. 

When  a  packet  is  sent  and  Retransmission-Timer  is  not  active,  TCP  sets 
it  to  the  current  value  of  RXT.  Every  500  ms,  a  software  clock  interrupt  (the 
same  one  that  increments  the  active  roundtrip  timers)  decrements  the  active 
retransmission  timers  of  all  TCP  connections  on  that  host.  If  the  packet  is 
not  acknowledged  before  its  Retransmission-Timer  becomes  zero,  the  packet 
is  retransmitted  and  the  timer  is  set  with  a  value  equal  to  twice  the  previous 
timeout  value.  If  the  packet  is  acknowledged  before  the  timer  becomes  zero, 

s  Actually,  it  increments  the  active  roundtrip  timers  of  all  TCP  connections  on  that 
host. 
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the  timer  is  reset  to  the  current  value  of  RXT  if  and  only  if  there  is  still 
some  outstanding  packet. 

From  a  trace,  we  can  compute  the  roundtrip  time  of  each  packet.  Us¬ 
ing  these,  we  can  simulate  the  effect  of  different  RXT  functions.  There  is 
an  assumption  underlying  our  treatment;  namely,  that  the  roundtrip  times 
experienced  by  the  packets  would  remain  the  same.  In  reality,  a  different 
RXT  function  can  cause  packet  transmission  times  to  be  different  from  those 
in  the  trace.  This  in  turn  can  affect  the  network  congestion  and  therefore 
the  roundtrip  times  of  the  packets.  Our  assumption  corresponds  to  ignoring 
this  feedback  effect.  Certainly  our  assumption  would  be  valid  in  situations 
of  low  user  load. 

We  now  identify  the  packets  whose  roundtrip  times  are  used  in  simulat¬ 
ing  the  TCP  RXT  functions.  First,  we  point  out  that  TCP  only  uses  the 
roundtrip  times  of  packets  that  were  not  retransmitted6.  Thus,  let  px , . , . ,  p# 
be  the  sequence  of  such  packets  sent  in  the  connection.  From  the  trace,  we 
can  obtain  the  transmission  time,  a,,  and  the  acknowledgement  time,  a,,  for 
each  pi.  We  have  RTT<  =  a,-  -  Si.  Second,  recall  that  a  TCP  entity  uses 
only  one  retransmission  timer  and  one  roundtrip  timer.  This  means  that 
only  the  RTT.’s  of  non-overlapping  packets  are  used  in  simulating  an  RXT, 
where  p,-  overlaps  with  pj  if  and  only  if  a,  <  s:  <  a;. 

Finally,  we  define  the  metrics  used  in  evaluating  an  RXT  function. 

•  Mean  Square  Error 


where  e,-  =  RXT,-  -  RTT,  and  RXT,  is  the  retransmission- timeout 
estimate  at  the  time  packet  p,-  is  sent. 

•  Mean  Square  Error  of  the  Under-estimations 

MSE-  = 

where  i  ranges  over  the  packet  numbers  for  which  e,  <  0.  (Packet 
numbers  are  same  as  transmission  numbers  defined  earlier.) 

•  Mean  Square  Error  of  the  Over-estimations 

MSE+  = 

where  i  ranges  over  the  packet  numbers  for  which  e<  >  0. 

8  A  packet  is  considered  retransmitted  if  even  one  octet  in  this  packet  is  retransmitted. 
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MSE,  MSE-,  and  MSE+  indicate  how  close  the  roundtrip-time  estimates 
are  to  the  actual  roundtrip  times  of  packets.  A  high  value  of  MSE-  implies  a 
large  number  of  unnecessary  retransmissions.  A  high  value  of  MSE+  implies 
large  delays  in  retransmissions  of  lost  packets,  resulting  in  under-utilization 
of  the  network. 

Experiments 

In  each  experiment  that  we  describe  here,  there  were  two  application 
processes,  a  data  source  and  a  data  sink.  (See  Figure  1.)  Both  processes 
were  on  the  host  huginn.cs.umd.edu  (which  is  a  VAXstation  3200)  at  the 
Computer  Science  Department  at  Maryland.  All  packets  and  acknowledge¬ 
ments  were  routed  via,  ucbvax.berkeley.edu  at  the  University  of  California, 
Berkeley. 

In  experiment  1,  the  source  generated  1  octet  of  data  every  second  for 
1000  seconds  (for  a  total  of  1000  octets).  This  experiment  was  carried  out 
at  night  when  the  network  load  is  typically  low. 

In  experiment  2,  the  source  generated  1000  octets  of  data  1000  times 
(for  a  total  of  106  octets).  The  data  was  generated  as  fast  as  the  local  TCP 
entity  could  accept.  This  experiment  was  done  during  the  day  when  the 
network  load  is  typically  higher.  For  timestamping  the  log  records,  we  used 
the  UNIX  internal  clock,  with  a  resolution  of  10  ms. 

Experiment  1:  Round  trip  times  using  500  ms  resolution 

Figure  2  shows  the  RTTs,  SRTT  and  RXT  in  experiment  1.  The  x- 
coordinate  is  the  packet  number.  Each  dot  (.)  represents  an  RTT  measure¬ 
ment.  Each  asterisk  (*)  represents  a  packet  lost  in  transit.  8  packets  were 
lost  in  transit.  The  values  of  SRTT  and  RXT  were  calculated  assuming  the 
500ms  clock  resolution  used  conventionally  by  TCP  for  RTT  measurements. 

Note  that  there  is  only  one  packet  (number  683)  whose  RTT  (1200  ms) 
exceeds  the  RXT  value  (1000  ms)  at  the  time  of  its  transmission.  We  notice 
from  Figure  2  that  TCP  greatly  overestimates  the  roundtrip  time.  The 
values  of  MSE,  MSE-  and  MSE+  are  465,  6  and  465  respectively  (also 
shown  in  Table  1). 

Experiment  1:  Round  trip  times  assuming  10  ms  resolution 

We  want  to  study  the  effect  of  increasing  the  clock  resolution  on  the  TCP 
roundtrip-time  measurements.  Figure  3  shows  the  RTTs  of  experiment  1, 
and  the  values  of  SRTT  and  RXT  assuming  a  10  ms  clock  resolution  for 
RTT  measurements.  Note  that  our  assumption  that  there  is  no  feedback 
effect  is  valid  in  this  experiment  because  the  packets  are  spaced  1  second 
apart.  Therefore,  there  is  no  interference  between  two  successive  packets. 
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The  values  of  MSE,  MSE-  and  MSE+  are  102,  26  and  98  respectively. 
It  is  clear  from  the  Figures  2  and  3  and  the  values  of  MSE+  in  Table  1  that 
RXT  values  are  much  closer  to  the  RTTs  if  a  10  ms  clock  resolution  is  used. 


clock  res. 

MSE  (-,  +) 

500  ms 

465  (6,  465) 

10  ms 

102  (26,  98) 

Table  1:  Experiment  1  with  different  clock  resolutions 

Experiment  2:  Round  trip  times  using  500  and  10  ms  resolution 
Figure  4  shows  the  RTTs  for  experiment  2,  and  values  of  SRTT  and  RXT 
assuming  a  500  ms  clock  resolution  for  RTT  measurements.  15  packets  were 
lost  in  transit  in  this  experiment.  Figure  5  shows  the  RTTs  for  experiment 
2,  and  values  of  SRTT  and  RXT  assuming  a  10  ms  clock  resolution  for  RTT 
measurements.  Here,  our  assumption  of  ignoring  the  feedback  effect  may 
not  be  valid. 

The  error  metrics  for  these  simulation  axe  given  in  Table  2.  We  again 
notice  that  an  increased  resolution  of  the  clock  results  in  RXT  values  to  be 
much  closer  to  the  RTTs. 


clock  res. 

MSE  (-,  +) 

500  ms 

785  (0,  785) 

10  ms 

136  (28,  134) 

Table  2:  Experiment  2  with  different  clock  resolutions 

Changing  estimator  parameters 

The  value  of  a  controls  how  rapidly  SRTT  adjusts  to  changing  network 
conditions.  A  smaller  value  of  a  allows  SRTT  to  adapt  more  swiftly.  We  next 
simulate  the  TCP  RXT  estimator  with  different  values  of  a,  with  a'  =  a, 
and  using  both  500  ms  and  10  ms  clock  resolutions.  Table  3  shows  the 
values  of  the  error  metrics  for  different  values  of  a  for  experiment  1.  Figure 
6  shows  these  values  graphically.  Table  4  shows  the  values  of  error  metrics 
for  different  values  of  a  for  experiment  2.  Figure  7  shows  these  values 
graphically. 

We  observe  that  with  clock  resolution  of  10  ms,  MSE  and  MSE+  remain 
approximately  the  same  for  different  values  of  a.  With  clock  resolution  of 
500  ms,  MSE  and  MSE+  decrease  as  the  value  of  a  decreases. 
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a 


a 


a 


clock  res. 

MSE  (-,  +) 

MSE  (-,  +) 

MSE  (-,  +) 

MSE  (-,  +) 

500  ms 

463  (6,  463) 

363  (32,  361) 

362  (32,  360) 

84  (72,  44) 

10  ms 

102  (25,  99) 

101  (27,  97) 

106  (28,  102) 

111  (30,  107) 

Table  3:  Experiment  1  with  different  values  of  a  (a'  =  a) 


.  «-i 

a  s 

«=§ 

«  =  5 

clock  res. 

MSE  (-,  +  ) 

MSE  (-,  +) 

MSE  (-,  +) 

MSE  (-,  +) 

500  ms 

840  (0,  840) 

509  (1,  509) 

425  (1,  425) 

258  (140,  217) 

10  ms 

139  (26,  137) 

136  (29,  133) 

140  (31,  136) 

142  (32,  138) 

Table  4:  Experiment  2  with  different  values  of  a  (a'  =  a) 


Increasing  the  number  of  packets  whose  RTT  is  measured 

Recall  that  TCP  does  not  measure  roundtrip  times  of  overlapping  pack¬ 
ets.  We  now  simulate  the  TCP  RXT  estimator  assuming  TCP  measures 
roundtrip  times  of  all  packets  that  are  not  retransmitted  and  whose  ac¬ 
knowledgements  are  not  lost. 

Figure  8  shows  the  values  of  SRTT  and  RXT  for  experiment  1  under 
this  assumption  (along  with  observed  RTTs).  The  error  metrics  are  given 
in  Table  5.  We  see  that  there  is  no  difference  between  Tables  5  and  3, 
which  is  to  be  expected  for  lightly  loaded  conditions.  Figures  9  and  10 
show  the  values  of  SRTT  and  RXT  for  experiment  2.  The  error  metrics  are 
given  in  Table  6.  Comparing  Tables  6  and  4,  we  observe  that  no  significant 
improvement  is  achieved  by  measuring  the  RTTs  of  more  packets. 


o  =  | 

0=1 

—  t 

o  =  | 

clock  res. 

MSE  (-,  +) 

MSE  (-,  +) 

MSE  (-,  +) 

MSE  (-,  +) 

500  ms 

463  (6,  463) 

363  (32,  361) 

362  (32,  360) 

84  (72,  44) 

10  ms 

102  (25,  99) 

101  (27,  97) 

106  (28,  102) 

111  (30,  107) 

Table  5:  Experiment  1  with  RTTs  measured  of  all  possible  packets 


RXT  estimator  suggested  in  the  TCP  specification 

The  TCP  specification  [14]  suggests  that  the  retransmission  timeout  be 
calculated  as 

RXT  =  2  SRTT 
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We  next  give  the  error  metrics  assuming  this  estimator,  for  different 
values  of  a,  with  a'  =  a.  Table  7  gives  the  values  for  experiment  1  when 
roundtrip  time  is  measured  only  for  non-overlapping  packets.  Table  8  gives 
the  values  for  experiment  1  when  roundtrip  time  is  measured  for  all  possible 
packets.  Table  9  gives  the  values  for  experiment  2  when  roundtrip  time  is 
measured  only  for  non-overlapping  packets.  Table  10  gives  the  values  for 
experiment  2  when  roundtrip  time  is  measured  for  all  possible  packets. 

We  notice  that  RXT  is  considerably  higher  than  RTTs,  irrespective  of  the 
resolution  of  the  clock  measuring  the  RTTs,  and  of  the  number  of  packets 
whose  roundtrip  time  is  measured.  When  we  compare  these  Tables  with 
Tables  3-6,  we  see  that  this  estimator  is  worse  than  Van  Jacobson’s  estimator 
[7],  which  is  currently  used  in  UNIX. 


—  i 

«=! 

«  =  » 

a  =  | 

clock  res. 

MSE  (-,  +) 

MSE  (-,  +) 

MSE  (-,  +) 

500  ms 

686  (0,  686) 

544  (2,  544) 

10  ms 

555  (0,  555) 

550  (0,  550) 

550  (1,  550) 

549  (1,  549) 

Table  7:  Expt.  1  with  RXT  =  2  SRTT  and  RTTs  of  non-overlapping  packets 


«=! 

a  =  f 

Q  =  S 

clock  res. 

MSE  (-,  +) 

MSE  (-,  +) 

MSE  (-,  +) 

MSE  (-,  +) 

500  ms 

686  (0,  686) 

629  (0,  629) 

E20ES1 

10  ms 

549  (1,  549) 

Table  8:  Expt.  1  with  RXT  =  2  SRTT  and  RTTs  of  all  possible  packets 
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_  .?  =  i 

—  * 

a  =  5 

clock 

MSE  (-,  +  ) 

MSE  (-,  +) 

MSE  (-,  +) 

MSE  (-,  +) 

500  ms 

1109  (0,  1109) 

800  (0,  800) 

725  (0,  725) 

627  (0,  627) 

10  ms 

673  (0,  673) 

674  (0,  674) 

675  (0,  675) 

676  (0,  676) 

Table  9:  Expt.  2  with  RXT  =  2  SRTT  and  RTTs  of  non-overlapping  packets 


—  » 

«  =  i 

«  =  S 

clock 

MSE  (-,  +) 

MSE  (-,  +) 

MSE  (-,  +) 

MSE  (-,  +) 

500  ms 

1224  (0,  1224) 

KKWtmmm 

748  (0,  748) 

662  (0,  662) 

10  ms 

jirlinmurltTiy 

697  (0,  697) 

695  (0,  695) 

Table  10:  Expt.  2  with  RXT  =  2  SRTT  and  RTTs  of  all  possible  packets 
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5  Conclusion 


In  this  report,  we  have  described  an  instrumentation  that  can  monitor  se¬ 
lected  TCP  connections.  The  instrumentation  scheme  is  designed  to  collect 
information  at  different  interfaces  in  a  TCP/IP  implementation.  The  cur¬ 
rent  version  is  implemented  in  4.3BSD  UNIX. 

The  instrumentation  provides  information  about  various  performance 
measures  and  internal  variables  of  an  implementation.  This  can  be  useful 
in  better  understanding  the  working  of  the  implementation,  which  in  turn 
can  help  in  determining  optimal  policies  for  TCP. 

We  have  used  the  instrumentation  to  study  the  effect  of  different  round- 
trip  time  estimators.  From  the  results  presented  in  this  paper,  it  is  clear 
that  a  high  resolution  clock  is  essential  to  obtain  good  estimates.  It  also 
appears  that  the  RXT  estimator  suggested  by  Van  Jacobson  [7]  performs 
better  than  the  one  suggested  in  the  TCP  specification  [14]. 

Elsewhere  [16],  we  have  used  our  instrumentation  to  find  the  number 
of  retransmissions,  packets  in  transit,  loss  rate,  etc,  to  study  response  time 
versus  packet  size,  and  to  validate  analytic  models  [1].  We  believe  that  the 
instrumentation  described  here  can  be  done  on  any  communication  protocol, 
to  test  the  protocol,  to  measure  its  performance,  and  to  validate  analytic 
models.  The  statistics  provided  by  such  instrumentation  would  be  a  good 
reference  point  to  compare  TCP  to  other  transport  protocols.  We  are  plan¬ 
ning  to  instrument  ISO  protocols  in  the  next  version  of  BSD  UNIX. 

In  the  future,  one  can  think  of  “log  servers,”  just  like  file  servers  or  re¬ 
mote  login  servers.  A  log  server  would  allow  a  remote  client  to  establish 
a  connection,  send  or  receive  data  according  to  a  specified  traffic  pattern, 
generate  a  local  trace,  and  ship  the  trace  over  to  the  client  at  the  end  of  the 
experiment.  This  would  allow  TCP  entities  that  do  not  have  instrumenta¬ 
tion  to  evaluate  the  performance  of  their  policies. 
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and  RXTs  vs  Packet  Numbers,  assuming  500ms  clock  resolution  in  RTT 
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Figure  4.  Experiment  2:  Observed 
and  RXTs  vs  Packet  Numbers, 
values . 
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Figure  7.  Experiment  2:  Error  metrics  vs  oC  ,  with  oC 
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Figure  9.  Experiment  2  with  simulated  SRTTs  and  RXTs  vs  Packet  Numbers, 
using  RTTs  of  all  possible  packets,  assuming  500ms  clock  resolution 
in  RTT  values. 
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